Cartography of Genomic Interactions Enables Deep Analysis of Single-Cell Expression Data

Remarkable advances in single cell genomics have presented unique challenges and opportunities for interrogating a wealth of biomedical inquiries. High dimensional genomic data are inherently complex because of intertwined relationships among the genes. Existing methods, including emerging deep learning-based approaches, do not consider the underlying biological characteristics during data processing, which greatly compromises the performance of data analysis and hinders the maximal utilization of state-of-the-art genomic techniques. In this work, we develop an entropy-based cartography strategy to contrive the high dimensional gene expression data into a configured image format, referred to as genomap, with explicit integration of the genomic interactions. This unique cartography casts the gene-gene interactions into the spatial configuration of genomaps and enables us to extract the deep genomic interaction features and discover underlying discriminative patterns of the data. We show that, for a wide variety of applications (cell clustering and recognition, gene signature extraction, single cell data integration, cellular trajectory analysis, dimensionality reduction, and visualization), the proposed approach drastically improves the accuracies of data analyses as compared to the state-of-the-art techniques.


Title: Cartography of Genomic Interactions Enables Deep Analysis of Single-Cell Expression Data
The authors wish to thank the editor and referees for their constructive comments. The manuscript has been extensively revised to address the questions raised by the referees, as detailed below.
Reviewer #1 (Remarks to the Author): In this version, the authors refined the notations and added some new results. In general, the quality of the paper has been improved. However, we are still not convinced of the necessity and novelty of Genomap. We also propose some analyses to validate the superiority of Genomap. Response: In this resubmission, we have added the analyses suggested by you to validate the superiority of Genomap for high-performance analysis of the gene expression data. We have also added additional mathematical analysis and results to show the necessity and novelty of Genomap as detailed below.
1. The authors used entropy formulas to justify why Genomap is more informative than the gene vector. However, the arguments are not convincing: the authors' 1D entropy formula assumes that genes are independent, which we know is not the case. The entropy should be defined based on genes' joint multivariate distribution, which can be estimated from the gene expression vectors and does not require the 2D-grid representation. Hence, the necessity of using the 2Dgrid representation + the blackbox CNN still lacks justification. Moreover, the 2D entropy formula is unclear: how is the p_{i,j}^k defined? This entropy definition is the theoretical foundation and thus needs more explanation. Response: In this resubmission, the entropy formulation used to explain the information increase when going from 1D to 2D-grid is updated upon your suggestion. The assumption of independence of genes is now not required. By following Refs. 1-4 , we provided a detailed analysis of the entropy of genomaps. Specifically, the entropy is now defined based on the joint multivariate distribution of the genes as suggested by you. It is seen from the revised formulation that the additional information (mutual entropy) in genomaps comes from the neighborhood information of the genes located in the 2D grid. Please see supplementary section 2 for details.
The spatially independent component of the entropy can be estimated from the gene expression vectors and does not require the 2D-grid representation. However, the spatial entropy (i.e. the mutual entropy) depends on the 2D grid representation. Please see supplementary section 2 and Refs. 1,2 for details on the procedure to compute the entropy in images. p_{i,j}^k is the joint probability of a gene located at the center of the k-th window and a gene positioned at (i,j) inside the window. Here, k=1 to n, n is the number of genes. We have revised the notation of the theoretical analysis of entropy in supplementary section 2 and added a detailed explanation of each of the terms.
2. Following the above comment, we have a suggestion for justifying the necessity of Genomap. Besides using the optimal transport to construct the map, the authors may also try a few other ways (e.g., the simplest way is a random 2D map; another way is in Comment 4). The authors can show both the entropy and the downstream analysis results will be worse than using the optimal transport. Response: We have added the entropy formulation and downstream performance analysis in terms of classification accuracy for both cases (random mapping and vec2image) as suggested by you in the revised manuscript (see supplementary section 2, Figs. 4-6, S4). It is seen that the genomap achieves much better performance in classification accuracy (>5% improvement) as compared to the random mapping and vec2image. Moreover, the entropy of both cases is worse than the genomaps constructed by using the optimal transport approach as shown in supplementary section 2 (see the last paragraph).
3. The authors claim that "We have added the neural network results in the revised manuscript, where we see that genomap+genonet improves the results by more than 5% in comparison with the neural network alone." We are not sure where the result is. Based on our guess, the authors are referring to the "ACTINN" method. Please make this clear in the manuscript; for example, why the authors use the ACTINN in the comparison. In addition, it is worth noting that ACTINN always leads to 2nd best results in all comparisons. The improvement of ACTINN vs. other methods is more significant than the improvement of Genomap vs. ACTINN. Considering that ACTINN is a neural network method developed by another group and not refined by the authors (note that there are many other neural network methods for classifying cells), we are not convinced that Genomap, rather than a neural network, is the main driver for improved classification accuracy. Response: The reviewer is correct that we referred to ACTINN as the neural network-based method, which has been clarified in the revised manuscript. We have used this technique because it is state-of-the-art (published in 2020) and used widely (cited by more than 99 times). Moreover, the authors of the work have made the code of ACTINN public, which we have used in our benchmarking.
To prove that genomap is the main driver for improved classification accuracy, we have added results (supplementary Fig. S5) for genoNet with 1D gene expression data input (1D expression+genoNet). In this case, when compared to genomap+genoNet, the only difference is the use or not use of the genomap. It is seen that the genomap+genoNet achieves much better performance (>5% improvement in classification accuracy) than 1D expression+genoNet. This indicates that the introduction of genomap improves classification accuracy. Furthermore, we also added results of genomap+ACTINN (with 2D input), which shows that the use of the genomap improves the classification accuracy by at least 5% in comparison with ACTINN with 1D input. 4. We also found a related paper published in Briefings in Bioinformatics: Vec2image: an explainable artificial intelligence model for the feature representation and classification of highdimensional biological data by vector-to-image conversion: https://academic.oup.com/bib/article/23/2/bbab584/6518046. Although this paper used tSNE instead of the optimal transport, the goal is also to convert a gene vector into an image and then use CNN for downstream analysis. Therefore, the authors need to justify the novelty of Genomap compared to this published article.

Response:
We have added the following discussion in the revised manuscript. "We note that there were a few attempts in the literature to convert gene expression data into images for deep learning [5][6][7][8] . In these methods, however, an image is usually made heuristically by projecting the HD data onto a 2D plane without any explicit constraint(s) on the spatial locations of the genes. As an example, In vec2image 8 , t-SNE (or other) embedding method is used to create the images. As a result, similar genes get clustered under the assumption of tdistribution without explicit constraints on their spatial positions for maximizing entropy. Therefore, many genes located in the outer region of the clusters have fewer neighbors than those located in the center of the cluster. The same is true for those non-clustered (i.e. isolated) genes. The reduction of the number of neighbors to many genes deteriorates the information extraction efficiency of the CNNs (because of the limited receptive field of CNN ). In contrast, the genomap here is established on a solid theoretical foundation with the goal of finding the optimal spatial configurational representation of the gene-gene interactions of the system. Thus, as shown in a variety of applications in the Results section, the deep analysis of genomaps leads to discoveries of highly distinctive patterns of the complex genomic data and provides a potentially useful analytic technique." We have also added the results of the vec2image method to the revised manuscript (Figs. 4-6, S4) and showed that the genomap+genoNet outperforms the vec2image method significantly (the accuracy improvement is at least 5%).

Reviewer #2 (Remarks to the Author):
Compared to previous version, the authors have improved their manuscript significantly, especially the description of methods. However, I still have some questions about evaluations. 1. I am quite confused about how to get the visualizations of the raw data in Figs. 4, 6 and 11. These results of raw data are too bad to believe. Were all the genes used to perform PCA, UMAP or tSNE directly? Was any dimension reduction done before UMAP or tSNE? The visualization for raw data should be performed after highly variable gene selection and dimension reduction such as PCA. In addition, in single cell RNA sequencing data analysis, PCA is just used for dimension reduction, not for visualization. Response: The protocol suggested by the reviewer was followed in our analysis and visualization. That is, we performed the visualization of the raw data after selection of the highly variable genes (HVGs) and dimensional reduction. We have clarified this issue in the revised manuscript. Please see the last paragraph of "Implementation and parameter settings" section.
Following the best practice in bioinformatics community 9 , only the HVGs were used in our analysis. PCA was done before t-SNE and UMAP. Based on your suggestion, we have removed the visualization results of PCA from Figures 4, 6 and 11.
2. In Fig. 8, what is the 10 most variable genes? Why were these genes selected? Why dont just select the marker genes for these cell types such as CD3, CD79 and so on? From the Fig.  8, Cell ID got totally random results on these genes, which is unreasonable. Response: We have selected the 10 most variable genes (highly variable genes-HVGs) because these genes contain the most information about the cell types. Gene names are shown in Fig. 8. In the revised supplementary (Fig. S21), we have added results for 6 CD (cluster of differentiation) marker genes which appeared in the first 1000 HVGs, where again we see that genomap outperforms Cell-ID in computing the gene activity. Please note that expressions of CD3 and CD79 genes were not included in the original dataset provided by the authors of Ref. 10 . Please see Ref. 10 for the description of the dataset and the gene list. We have double checked the results of cell ID for both HVGs and marker genes and found that the results on these genes are inferior to that of the genomap+genoNet. However, we believe that the index "Pearson correlation coefficient" we were using for comparing the gene activity may not be appropriate because of the assumption of Gaussian distribution of the data. In Ref. 11 , Kowalski's analysis concludes that the distribution of Pearson correlation coefficient is not robust in the presence of non-normality. Therefore, in the revised manuscript, we used cosine similarity 12,13 which is a popular and established index to quantify similarities between two gene expression vectors.
3. In Fig. 11, the authors should select popular single cell data analysis methods to compare with their method, such as PCA for dimension reduction (NOT for visualization), Louvain or Leiden for clustering. LDA, Siamese network, supervised UMAP, these methods are seldom used for real single cell data analysis. Response: Based on your suggestions, we have included the results for PCA as dimensionality reduction, Louvain for clustering in Figure 11. Once again, genomap+genoNet performs significantly better than the available methods. We have removed the visualization results of PCA from Figures 4,6 and 11 per your suggestion. and CD79B genes in B-cells. It is reasonable as CD3D, CD3E and CD79A, CD79B are established markers for T-cells and B-cells, respectively 2 . However, our approach shows much lower activities of these genes in other cells such as Monocyte, NK cell and Pneumocyte, which aligns well with the ground truth sci-ATAC-seq data. Again, our approach significantly outperforms Cell-ID when their performances are compared with the ground truth in terms of cosine similarity.
3. The cell type labels used in the manuscript are from the original paper. However, these labels are based on the cluster results and not the ground truth. So, it is hard to say that genomap+genoNet performs significantly better than other methods such as ACTINN and Vec2image as their accuracies are comparable. These small accuracy differences between these methods might come from the inaccurate labels. Response: We used the labels from the original paper to benchmark our method against 11 existing methods. Here, we note that the labels created in the original publication are based on both clusters and marker genes (see pages 1310 and 1311 of Ref. 1), which is a widely used technique in cell labeling. Our method outperforms ACTINN and Vec2image by 5% to 10% in cell classification. This performance improvement should be considered significant even in the presence of slight inaccuracies of the cell labeling. Here, we note that in some of our accuracy plots, the actual accuracy improvement may not look significant because of the large range of yaxis (generally 0-100). As an example, below we add Fig. 6(b), where the accuracies of our approach, Random map and Vec2Image may look comparable. However, in reality, the accuracy of our approach is 87%, whereas Random map and Vec2Image have accuracies of 81% and 82%. For convenience, we have added the source data of our accuracy plots as excel files in this resubmission. Moreover, we have reduced the range of y-axis for better visualization of the improvement in accuracy, where possible (such as Figs. 4c and 5).